[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

Zantares · 2019-03-07T06:29:09Z

This could reduce the memory access and get good cache locality for CPU.

modified:

tensorflow/core/kernels/training_ops.cc
tensorflow/core/kernels/training_ops.h
tensorflow/core/kernels/training_ops_gpu.cu.cc

Signed-off-by: Lu Teng teng.lu@intel.com

This could reduce the memory access and get good cache locality for CPU. modified: - tensorflow/core/kernels/training_ops.cc - tensorflow/core/kernels/training_ops.h - tensorflow/core/kernels/training_ops_gpu.cu.cc Signed-off-by: Lu Teng teng.lu@intel.com

Zantares · 2019-03-07T06:39:41Z

Eigen device expression can only update 1 variable once, but Adam needs to update 3 variables and uses 3 expression which would impact the cache locality of CPU. Here use Shard function to replace Eigen device expression.

This patch is tested on NCF model which is in MLPerf 0.5 submission.
It can speed up Adam kernel 15%~30%, and improve 10%~20% overall for the model.

agramesh1 · 2019-03-11T23:48:03Z

pinging @ezhulenev for a review. Thanks.

ezhulenev

Could you please also add a benchmark similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc, I'll need it to run performance testing internally.

tensorflow/core/kernels/training_ops.cc

To get better cache locality, use Shard instead of Eigen expression.

Also added a benchmark to test Adam performance.

Zantares · 2019-03-14T17:04:21Z

Could you please also add a benchmark similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc, I'll need it to run performance testing internally.

Hi, @ezhulenev , I've refined the code and added a benchmark https://github.com/tensorflow/tensorflow/pull/26424/files#diff-0b9bd0c5daec98f25d2e15c9b8c0370cR200.

My test result is:

original:
Benchmark Time(ns) Iterations

BM_SGD/131072 58816 10000 8914.0MB/s 2228.5M items/s
BM_SGD/262144 107837 6335 9723.7MB/s 2430.9M items/s
BM_Adagrad/131072 136092 5126 3852.5MB/s 963.1M items/s
BM_Adagrad/262144 216470 3093 4844.0MB/s 1211.0M items/s
BM_Momentum/131072 126981 5114 4128.9MB/s 1032.2M items/s
BM_Momentum/262144 206378 3434 5080.9MB/s 1270.2M items/s
BM_Adam/131072/0 194416 3452 2696.7MB/s 674.2M items/s
BM_Adam/262144/0 334504 2110 3134.7MB/s 783.7M items/s
BM_Adam/16777216/1 9864090 100 6803.4MB/s 1700.8M items/s
BM_RMSProp/131072 187562 3545 2795.3MB/s 698.8M items/s
BM_RMSProp/262144 334770 2180 3132.2MB/s 783.1M items/s
BM_AddSign/131072 512574 1449 1022.9MB/s 255.7M items/s
BM_AddSign/262144 922186 727 1137.1MB/s 284.3M items/s
BM_PowerSign/131072 2179936 311 240.5MB/s 60.1M items/s
BM_PowerSign/262144 3963514 177 264.6MB/s 66.1M items/s

optimized:
Benchmark Time(ns) Iterations

BM_SGD/131072 69680 9636 7524.2MB/s 1881.0M items/s
BM_SGD/262144 95855 5395 10939.2MB/s 2734.8M items/s
BM_Adagrad/131072 158376 5181 3310.4MB/s 827.6M items/s
BM_Adagrad/262144 234968 2831 4462.6MB/s 1115.7M items/s
BM_Momentum/131072 118026 5495 4442.1MB/s 1110.5M items/s
BM_Momentum/262144 215430 3169 4867.4MB/s 1216.8M items/s
BM_Adam/131072/0 202429 3486 2590.0MB/s 647.5M items/s
BM_Adam/262144/0 328765 1965 3189.4MB/s 797.4M items/s
BM_Adam/16777216/1 7393820 100 9076.3MB/s 2269.1M items/s
BM_RMSProp/131072 187683 3003 2793.5MB/s 698.4M items/s
BM_RMSProp/262144 372268 2006 2816.7MB/s 704.2M items/s
BM_AddSign/131072 648737 1000 808.2MB/s 202.0M items/s
BM_AddSign/262144 876228 764 1196.7MB/s 299.2M items/s
BM_PowerSign/131072 2178264 322 240.7MB/s 60.2M items/s
BM_PowerSign/262144 3908483 174 268.3MB/s 67.1M items/s

There may be some data variance between different execution, but you can find the optimized is always better than the original.

Zantares · 2019-03-14T17:08:27Z

BTW, the Tensor vectorization form has similar performance with the loop version, I guess maybe Eigen can generate same ASM internally, but I didn't dig too deep.

tensorflow/core/kernels/training_ops.cc

ezhulenev · 2019-03-14T17:10:23Z

tensorflow/core/kernels/training_ops.cc

+      length = length / size;
+    } else {
+      size = 1;
+    }


There is no need to divide the input size by the packet size, and do "manual vectorization". If it's desirable to have shard size (end-begin) to be a multiple of a packet size, you can pass block_align to parallelFor (see https://bitbucket.org/eigen/eigen/src/4b28c8008901c6d760f48f26ee2e3423fd8a2b40/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h#lines-185). \

I think this should work:

[](Index index) -> Index { return Eigen::divup(index, packet_size); }

I got some question when try to use this function, please see my comment below.

tensorflow/core/kernels/training_ops_test.cc

ezhulenev · 2019-03-14T17:16:49Z

I guess after inlining it all might have been fused into a single loop by compiler. Anyway it's great that there is no performance difference and we can keep simpler code.

Zantares · 2019-03-15T17:07:44Z

When I try to use block_align to align shard size, I found the performance was decreased in real model, then I captured the param size from model and made a small benchmark: e4dae32#diff-0b9bd0c5daec98f25d2e15c9b8c0370cR200.

env: Intel Xeon skylake-8180, 56 cores
cmd: numactl -N 0 -l bazel run --config=mkl --copt=-mavx2 --copt=-mfma --copt=-march=broadwell --copt=-O2 --copt=-L$HOME/code/1/gcc6/gcc6.3/lib64/ -- //tensorflow/core/kernels:training_ops_test -- --benchmarks=..

current implementation:

BM_Adam/8192/1 82133 8295 399.0MB/s 99.7M items/s
BM_Adam/16777216/1 8501990 100 7893.3MB/s 1973.3M items/s

with `block_align`:

BM_Adam/8192/1 88312 7707 371.0MB/s 92.8M items/s
BM_Adam/16777216/1 8462090 100 7930.5MB/s 1982.6M items/s

With "manual vectorization", the small benchmark will get better performance(+10%). It's really confused me, maybe Eigen efficiency model https://bitbucket.org/eigen/eigen/src/4b28c8008901c6d760f48f26ee2e3423fd8a2b40/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h?fileviewer=file-view-default#TensorDeviceThreadPool.h-188 can't handle small size with block_align well?

How I use block_align

+    // Set a function to align block size to packet size, which can get more
+    // chance to vectorize.
+    auto block_align = [packet_size](Index block_size) -> Index {
+      return Eigen::divup(block_size, packet_size) * packet_size;
+    };
+    d.parallelFor(length, cost, block_align, shard)

block_align will get the block_size computed by Eigen efficiency model, it allows us round up the size with own rule and return it as the new block size. I must increase the block size in block_align , or it will trap in an infinity loop.
Base on the result, I prefer to use "manual vectorization" version, how do you think about this situation?

ezhulenev · 2019-03-15T18:41:22Z

That's strange. I'll try to reproduce it internally after it will be merged.

ezhulenev · 2019-03-15T18:43:39Z

I think the problem is in incorrectly computed cost, and Eigen sharding too much or too less.

Zantares · 2019-03-19T16:46:40Z

I think the problem is in incorrectly computed cost, and Eigen sharding too much or too less.

Hi @ezhulenev ,please take a look at the new commit, I fixed an error of computing cost - need to multiply length to compute_cycles. I also checked the CostModel in Eigen, it already had some estimation on cache so I added the store cost. This should be the last commit if no more review suggestion.

The 'manual vectorization' is still better than block_align, I guess because compiler could get more static info from 'manual vectorization' while block_align may generate a tail that can't be divided in run-time.

…adam PiperOrigin-RevId: 239499372

googlebot added the cla: yes label Mar 7, 2019

Zantares changed the title ~~Use Shard function instead of Eigen device to parallelize Adam kernel.~~ [Intel MKL]Use Shard function instead of Eigen device to parallelize Adam kernel. Mar 7, 2019

Zantares changed the title ~~[Intel MKL]Use Shard function instead of Eigen device to parallelize Adam kernel.~~ [Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. Mar 7, 2019

rthadur requested a review from yifeif March 7, 2019 20:58

rthadur added awaiting review Pull request awaiting review size:M CL Change Size: Medium labels Mar 7, 2019

rthadur added this to Assigned Reviewer in PR Queue via automation Mar 7, 2019

rthadur self-assigned this Mar 7, 2019

ezhulenev reviewed Mar 12, 2019

View reviewed changes

tensorflow/core/kernels/training_ops.cc Show resolved Hide resolved

tensorflow/core/kernels/training_ops.cc Outdated Show resolved Hide resolved

tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 12, 2019

Zantares added 2 commits March 13, 2019 14:36

Add comment for Shard function

52f2440

To get better cache locality, use Shard instead of Eigen expression.

Refine code with simple Tensor vectorization form.

06dd621

Also added a benchmark to test Adam performance.

ezhulenev requested changes Mar 14, 2019

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Mar 14, 2019

ezhulenev mentioned this pull request Mar 14, 2019

[Intel Mkl] Parallel BiasAddGrad op with eigen intra thread pool #26426

Merged

New small benchmark and excatly var name.

e4dae32

ezhulenev previously approved these changes Mar 15, 2019

View reviewed changes

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 15, 2019

Fix shard cost and var name.

2160c84

Zantares dismissed ezhulenev’s stale review via 2160c84 March 19, 2019 16:31

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Mar 19, 2019

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 19, 2019

ezhulenev approved these changes Mar 19, 2019

View reviewed changes

rthadur added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 19, 2019

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 19, 2019

tensorflow-copybara merged commit 2160c84 into tensorflow:master Mar 20, 2019

PR Queue automation moved this from Approved by Reviewer to Merged Mar 20, 2019

tensorflow-copybara pushed a commit that referenced this pull request Mar 20, 2019

Merge pull request #26424 from Intel-tensorflow:Intel-TF/tenglu/fuse_…

36f550f

…adam PiperOrigin-RevId: 239499372

Zantares deleted the Intel-TF/tenglu/fuse_adam branch March 21, 2019 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

Zantares commented Mar 7, 2019

Zantares commented Mar 7, 2019

agramesh1 commented Mar 11, 2019

ezhulenev left a comment

Zantares commented Mar 14, 2019

Zantares commented Mar 14, 2019

ezhulenev Mar 14, 2019

Zantares Mar 15, 2019

ezhulenev commented Mar 14, 2019

Zantares commented Mar 15, 2019 •

edited

ezhulenev commented Mar 15, 2019

ezhulenev commented Mar 15, 2019

Zantares commented Mar 19, 2019 •

edited

[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

Conversation

Zantares commented Mar 7, 2019

Zantares commented Mar 7, 2019

agramesh1 commented Mar 11, 2019

ezhulenev left a comment

Choose a reason for hiding this comment

Zantares commented Mar 14, 2019

original: Benchmark Time(ns) Iterations

optimized: Benchmark Time(ns) Iterations

Zantares commented Mar 14, 2019

ezhulenev Mar 14, 2019

Choose a reason for hiding this comment

Zantares Mar 15, 2019

Choose a reason for hiding this comment

ezhulenev commented Mar 14, 2019

Zantares commented Mar 15, 2019 • edited

current implementation:

with block_align:

ezhulenev commented Mar 15, 2019

ezhulenev commented Mar 15, 2019

Zantares commented Mar 19, 2019 • edited

original:
Benchmark Time(ns) Iterations

optimized:
Benchmark Time(ns) Iterations

Zantares commented Mar 15, 2019 •

edited

with `block_align`:

Zantares commented Mar 19, 2019 •

edited